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Abstract: This paper aims at developing a quasi-Bayesian analysis of the 
nonparametric instrumental variables model, with a focus on the asymp- 
totic properties of quasi-posterior distributions. In this paper, instead of 
assuming a distributional assumption on the data generating process, we 
consider a quasi-likelihood induced from the conditional moment restric- 
tion, and put priors on the function- valued parameter. We call the resulting 
posterior quasi-posterior, which corresponds to "Gibbs posterior" in the lit- 
erature. Here we use sieve priors, which are prior distributions that concen- 
trate on finite dimensional sieve spaces. The dimension of the sieve space 
should increase as the sample size. We derive rates of contraction and a 
non-parametric Bernstein-von Mises type result for the quasi-posterior dis- 
tribution, and rates of convergence for the quasi-Bayes estimator defined by 
the posterior expectation. We show that, with priors suitably chosen, the 
quasi-posterior distribution (the quasi-Bayes estimator) attains the mini- 
max optimal rate of contraction (convergence, respectively). These results 
greatly sharpen the previous related work. 
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1. Introduction 
1.1. Overview 

Let (Y, X, W) be a triplet of scalar random variables, where Y is a dependent 
variable, X is an endogenous variable and W is an instrumental variable. With- 
out loosing much generality, we assume that the support of (X, W) is contained 
in [0, l] 2 . The support of Y may be unbounded. We consider the nonparametric 
instrumental variables (NPIV) model of the form 

E[Y | W] = E[g (X) | W], (1) 
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where go ■ [0, 1] — > R is an unknown structural function of interest. If we define 
U = Y — go(X), (1) reduces to the conventional form 



Here X is potentially correlated with U and hence K[U \ X] ^ 0. 

Suppose that (X, W) has square-integrable joint density fx,w{x, w) on [0, l] 2 
and denote by fw{w) the density of W . Define the linear operator K : L2[0, 1] — > 



Let h(w) = E[Y | W = w]fw(w). Then, the conditional moment restriction (1) 
is equivalent to the operator equation 



Assume that K is injective to guarantee identification of go- The function h is 
relatively standard to estimate. However, even though K is injective, its inverse 
K^ 1 is not ^-continuous since K is Hilbert-Schmidt and hence the l-th largest 
singular value, denoted by hi, is approaching zero as I — > oo. In this sense, the 
problem of recovering go from h is ill-posed. 

A model of the form (1) is of principal importance in econometrics (see 
Hall and Horowitz, 2005; Horowitz, 2011). From a statistical perspective, the 
problem of recovering the structural function go is challenging since it is an 
ill-posed inverse problem with an additional difficulty of unknown K (further- 
more, it is not plausible to think of that K is known up to a random error 
independent of the data, which is a notable difference from the case consid- 
ered in Hoffman and Reiss, 2008). Statistical inverse problems, including the 
current problem, have attracted considerable interest in statistics, econometrics 
and mathematical analysis. We refer the reader to Kless (1999) for a textbook 
treatment of linear inverse problems, and Cavalier (2008) for a recent review on 
statistical inverse problems. 

Approaches to estimating the structural function go are roughly classified into 
two types: the method involving the Tikhonov regularization (Hall and Horowitz, 
2005; Darolles ct al., 2011) and the sieve-based method (Newcy and Powell, 
2003; Ai and Chen, 2003; Blundell et al, 2007; Horowitz, 2012). 1 The minimax 
optimal rates of convergence in estimating the structural function go are es- 
tablished in Hall and Horowitz (2005) and Chen and Reiss (2011). Similarly to 
other statistical inverse problems, these rates are characterized by the smooth- 
ness of go and the "ill-posedness" of the problem. The optimal rates are achieved 
by the estimators proposed by Hall and Horowitz (2005) and Blundell et al. 
(2007) under their respective assumptions. 

All the above mentioned studies are from a purely frequcntist perspective. Lit- 
tle is known about the theoretical properties of Bayes or quasi-Bayes analysis of 

1 The sieve-based method is approximately the Galcrkin projection method in mathematical 
analysis. 



Y = g (X) + U, E[U \W}=0. 



L 2 [0,l] by 




Kg = h. 



(2) 
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the NPIV model. Exceptions are Florens and Simoni (2012a) and Liao and Jiang 
(2011) with which we will compare our work in the subsection below. 

This paper aims at developing a quasi-Bayesian analysis of the NPIV model, 
with a focus on the asymptotic properties of quasi-posterior distributions. The 
approach taken is quasi-Bayes in the sense that any specific distribution of 
(Y, X, W) is not assumed and the analysis is based upon a quasi-likelihood 
induced from the conditional moment restriction. The quasi-likelihood is con- 
structed by first estimating the conditional moment function m(-,g) = E[Y — 
g(X) | W = ■] in a nonparametric way, and taking exp{— (1/2) Y%=i "^(^iiS 1 )} 
as if it were a likelihood of g. For this quasi-likelihood, we put a prior on the 
function-valued parameter g. By doing so, formally, the posterior distribution 
for g may be defined, which we call "quasi-posterior distribution" . This poste- 
rior corresponds to what Jiang and Tanner (2008) called "Gibbs posterior" , and 
has a substantial interpretation (see Proposition 1 ahead). 

This framework of the quasi-posterior (Gibbs posterior) allows us a flexibility 
since a stringent distributional assumption, such as normality, on the data gen- 
erating process is not required. Such a framework widens a Bayesian approach 
to broad fields of statistical problems, as Jiang and Tanner (2008, p. 2211) re- 
marked: "This framework of the Gibbs posterior has been overlooked by most 
statisticians for a long time [• • ■ ] a foundation for understanding the statistical 
behavior of the Gibbs posterior, which we believe will open a productive new 
line of research." 

In this paper, we shall use sieve priors, which are prior distributions that 
concentrate on finite dimensional sieve spaces. The dimension of the sieve space, 
which plays a role of regularization parameter, should go to infinity as the 
sample size. Potentially, there are several choices in sieve spaces. Here we choose 
to use wavelet bases to form sieve spaces. Wavelet bases are useful to treat 
smoothness function classes such as Holdcr-Zygmund and Sobolev spaces in a 
unified and convenient way. Likewise, we shall use wavelet series estimation of 
the conditional moment function m(-,g). 2 

Under this setup, we study the asymptotic properties of the quasi-posterior 
distribution. The results obtained are summarized as follows. First, we derive 
rates of contraction for the quasi-posterior distribution and establish conditions 
on priors under which the minimax optimal rate of contraction is attained. Here 
the contraction is stated in the standard L2-norm. Second, we show asymptotic 
normality of the quasi-posterior of the first k n generalized Fourier coefficients, 
where k n — > oo is the dimension of the sieve space. This may be viewed as a non- 
parametric Bernstein-von Mises type result (see van der Vaart, 1998, Chapter 
10 for the classical Bernstein-von Mises theorem for regular parametric models). 
Third, we derive rates of convergence of the quasi-Bayes estimator defined by 
the posterior expectation and show that under some conditions it attains the 
minimax optimal rate of convergence. Finally, we give some specific sieve priors 
for which the quasi-posterior distribution (the quasi-Bayes estimator) attains the 

2 This does not rule out the use of other bases such as the Fourier and Hermite polynomial 
bases. See Remark 5. 
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minimax optimal rate of contraction (convergence, respectively). These results 
greatly sharpen the previous work of e.g. Liao and Jiang (2011), as we will 
review below. 

1.2. Literature review and contributions 

Closely related are Florens and Simoni (2012a) and Liao and Jiang (2011). The 
former paper worked on the reduced form equation Y = K[go(X) \ W] + V with 
V = U + g (X) — E[g (X) \ W] and assumed V to be normally distributed. They 
considered a Gaussian prior on g, and because of that the posterior distribution 
is also Gaussian. They proposed to "regularize" the posterior and established 
frequentist rates for the "regularized" posterior mean. Obviously, the present 
paper largely differs from Florens and Simoni (2012a) in that (i) we do not 
assume normality of the "error"; (ii) roughly speaking, Florens and Simoni's 
method is tied with the Tikhonov rcgularization method, while ours is tied 
with the sieve-based method. Liao and Jiang (2011) developed an important 
unified framework in estimating conditional moment restriction models based 
on a quasi-Bayesian approach, and their scope is more general than ours. They 
analyzed NPIV models in detail in their Section 4. Their posterior construction 
is similar to ours such as the use of sieve priors, but differs from ours in detail. For 
example, Liao and Jiang (2011) transformed the conditional moment restriction 
into unconditional moment restrictions with increasing number of restrictions. 
On the other hand, we directly work on the conditional moment restriction. 

Importantly and substantially, none of these papers did not establish sharp 
contraction rates for their (quasi-)postcrior distributions, nor asymptotic nor- 
mality results. It is unclear whether Florens and Simoni's rates are optimal, 
since their assumptions are substantially different from the past literature such 
as Hall and Horowitz (2005) and Chen and Reiss (2011). Liao and Jiang only 
established posterior consistency. Here, we focus on a simple but important 
model, and establish the sharper asymptotic results for the quasi-posterior dis- 
tribution. Notably, a wide class of (finite dimensional) sieve priors is shown to 
lead to the optimal contraction rate. Moreover, in Liao and Jiang (2011), a point 
estimator of the structural function is not formally analyzed. Hence the primal 
contribution of this paper is to considerably deepen the understanding of the 
asymptotic properties of the quasi-Bayesian procedure for the NPIV model. 

The present paper deals with a quasi-Bayesian analysis of an infinite dimen- 
sional model. The literature on theoretical studies of Bayesian analysis of infinite 
dimensional models is large. Ghosh and Ramamoorthi (2003) is a good reference 
on this topic. We refer the reader to Ghosal et al. (2000); Shen and Wasserman 
(2001); Kleijn and van der Vaart (2006); Ghosal and van der Vaart (2007) for 
general contraction rates results for posterior distributions in infinite dimen- 
sional models. Note that these results do not directly apply to our case since 
the "likelihood" here is nonparametrically estimated. The paper also contributes 
to the literature on Bayesian analysis of linear inverse problems, which we be- 
lieve is still an under-developed area of research. For nonparamctric Bayesian 
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analysis of inverse problems other than NPIV models, we refer to Cox (1993); 
Florens and Simoni (2012b); Knapik et al. (2011). 

Our asymptotic normality result builds upon the previous work on asymp- 
totic normality of (quasi-)posterior distributions for models with increasing 
number of parameters (Ghosal, 1999, 2000; Belloni and Chernozhukov, 2009a,b; 
Boucheron and Gassiat, 2009; Clarke and Ghosal, 2010; Bontemps, 2011). Re- 
lated is Bontemps (2011), in which the author established Bcrnstein-von Mises 
theorems for Gaussian regression models with increasing number of regres- 
sors and improved upon the earlier work of Ghosal (1999) in several aspects. 
Bontemps (2011) covered nonparametric models by taking into account model- 
ing bias in the analysis. However, none of these papers did not cover the NPIV 
model, nor more generally linear inverse problems. 

1.3. Organization and notation 

The remainder of the paper is organized as follows. Section 2 gives an infor- 
mal discussion of the quasi-Bayesian analysis of the NPIV model. Section 3 
summarizes some basic facts on wavelet theory and introduces the posterior 
construction used in the analysis. Section 4 contains the main results of the pa- 
per. Section 5 analyzes some specific sieve priors. Section 6 contains the proofs 
of the main results. Section 7 concludes with some further discussions. Appendix 
contains some technical results omitted in the main body. 

Notation: For any given (random or non-random, scalar or vector) sequence 
{ z i}i=i> we use the notation 



which should be distinguished from the population expectation E[-]. For any 
vector z, let z® 2 = zz T where z T is the transpose of z. For any two sequences 
of positive constants r n and s n , we write r n < s n if the ratio r n /s n is bounded, 
and r n ~ s n if r n < s n and s n < r n . Let L2[0, 1] denote the usual L2 space 
with respect to the Lebesgue measure for functions defined on [0,1]. Let || • | 
denote the L2-norm, i.e., ||/|| 2 = f 2 {x)dx. The inner product in 1] 

is denoted by (■,•), i.e., (/, g) = L f(x)g(x)dx. Let C[0, 1] denote the metric 
space of all continuous functions on [0, 1], equipped with the uniform metric. For 
any function / : [0,1] — > K, let ||/||oo = sup^g^j |/(ar)|. The Euclidean norm 
is denoted by || • For any matrix A, let s min (A) and s max (^4) denote the 
minimum and maximum singular values of A, respectively. Let ||j4|| p denote 
the operator norm of matrix A (i.e., ||^4||o P = s max (^4))- Denote by dN(n, T,)(x) 
the density of the multivariate normal distribution with mean vector \i and 
covariancc matrix S. 




i=l 
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2. Quasi-Bayesian analysis: informal discussion 

In this section, we outline a quasi-Bayesian analysis of the NPIV model (1). The 
discussion here is informal. The formal discussion is given in Section 4. 

Let Q be a parameter space (say, some smoothness class of functions, such as 
a Holder- Zygmund or Sobolev space), for which we assume go € Q. We assume 
that Q is at least contained in C[0, 1]: Q C C[0, 1]. Define the conditional moment 
function as m(W, g) = E[Y — g(X) | W], g € Q. Then go satisfies the conditional 
moment restriction 



Equivalently, we have E[m 2 (W, g )] = 0. 

In this paper, any specific distribution of (Y, X, W) is not assumed. So a 
Baycsian analysis in the standard sense is not applicable here since a proper 
likelihood for g {g is a generic version of go) is not available. Instead, we use a 
quasi- likelihood induced from the conditional moment restriction (3). 



Let (Y 1 ,X 1 ,W 1 ),...,(Y n ,X n ,W n ) be i.i.d. observations of (Y,X,W). Let 
W n = {W u . . . , W n } and V„ = {(Yi,Xi,Wi), (Y n ,X n , W n )}. By (3), a 
plausible candidate of the quasi-likelihood would be 



since p g (W n ) is maximized at the true structural function 170- Here, recall that 
E„[2j] = rT x z i f° r an y sequence {^}™ =1 . However, this p g (W n ) is infea- 

sible since m(-, g) is unknown. Instead of using p g (W n ), we replace m(-, g) by a 
suitable estimate m(-,g) and use the quasi-likelihood of the form 



Below we use a wavelet series estimator of m(-,g). 

The quasi-Bayesian analysis considered here uses this quasi-likelihood as if 
it were a proper likelihood and puts priors on g € Q. In this paper, as in 
Liao and Jiang (2011), we shall use sieve priors. The basic idea is to construct 
a sequence of finite dimensional sieve spaces (say, Q n ) that well approximates 
the parameter space Q (i.e., each function in Q is well approximated by some 
function in Q n as n becomes large), and put priors concentrating on these sieve 
spaces. Each sieve space is a subset of a linear space spanned by some basis 
functions. Hence the problem reduces to putting priors on the coefficients on 
those basis functions. Such priors arc typically called "(finite dimensional) sieve 
priors" (or "series priors") and have been widely used in the nonparametric 
Baycsian and quasi-Bayesian analysis (see e.g. Ghosal ct al., 2000; Scricciolo, 
2006; Ghosal and van dcr Vaart, 2007). 

Let n n be a so-constructed prior on g £ Q . Formally, the posterior distribu- 
tion of g given T> n may be defined by 



m(W,g ) = 0,a.s. 



(3) 



Pg (W n )=exp{-(n/2)E n [m 2 (W„g)}}, 



p g {V n ) =exp{-(n/2)E n [m 2 (Wi, 5 )]}. 



n„(d 5 I v n ) 



p g (V n )U n (dg) 
Jp g (V n )U n (dg) 7 



(4) 
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which we call "quasi-posterior distribution" . The quasi-posterior distribution is 
not a proper posterior distribution in the strict Bayesian sense since p g (T> n ) 
is not a proper likelihood. Nevertheless, H n (dg \ T> n ) is a proper distribution, 
i.e., f TL n (dg \ V n ) = 1. Similarly to proper posterior distributions, contraction 
of the quasi-posterior distribution around go intuitively means that it contains 
more and more accurate information about the true structural function go as 
the sample size increases. Hence, as in proper posterior distributions, it is of 
fundamental importance to study rates of contraction of quasi-posterior distri- 
butions. Here we say that the quasi-posterior Tl n (dg \ T> n ) contracts around go 

at rate e n ->• if U n (g : \\g - g \\ > e n | V n ) 0. 

This quasi-posterior corresponds to what Zhang (2006b) called "Gibbs al- 
gorithm" and what Jiang and Tanner (2008) called "Gibbs posterior" . Here an 
interesting interpretation of the quasi-posterior is obtained. 

Proposition 1. Let i] > be a fixed constant. Let II be a prior distribution for 
g defined on, say, the Borel a- field o/C[0, 1]. Suppose that the data T> n are fixed 
and the maps g i— > rhi{W%,g) are measurable with respect to the Borel a-field of 
C[0, 1]. Then, the distribution 

n (d 5 ) - cx PH?£r=i™ 2 ( w '^5))n(d5) 



minimizes the empirical information complexity defined by 

/n 
Y,rh 2 (W l ,g)U(dg)+r 1 - 1 D KL (U\\U) (5) 
t=i 

over all distributions H absolutely continuous with respect to H. Here 

D KL (U || n) = J 7rlog7rn(dg), with dil/dU = tt, 
is the Kullback-Leibler divergence from fl to II. 

Proof. Immediate from Zhang (2006a, Proposition 5.1). □ 

The proposition shows that, given the data T> n and a prior n = n n on g, 
the quasi-posterior H n (dg \ T> n ) defined in (4) is obtained as a minimizer of 
the empirical information complexity defined by (5) with r\ = 1/2. This gives a 
rational to use Ii n (dg \ T> n ) as a quasi-posterior since, among all possible "quasi- 
posteriors", this H n (dg | T> n ) optimally balances the average of the natural 
loss function g M> J^ILi ™ 2 (W / i^) an d its complexity (or deviation) relative to 
the initial prior distribution measured by the Kullback-Leibler divergence. The 
scaling constant ("temperature") r\ is taken to be 1/2 here. However, changing 
this value does not substantially affect the asymptotic analysis. 

The quasi-posterior distribution provides point estimators of go- A most nat- 
ural estimator would be the estimator defined by the posterior expectation (the 
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expectation of the quasi-posterior distribution), i.e., 



9QB 



( 



J gli n (dg | T> n ), if the right integral exists, 
0, otherwise, 



(6) 



where the integral J gll n (dg \ T> n ) is understood as pointwise. 

In Section 4, we will study the asymptotic properties of the quasi-posterior 
distribution and the quasi-Bayes estimator. In doing so, we have to specify 
certain regularity properties, such as the smoothness of go and the degree of 
ill-posedness of the problem. How to characterize the "smoothness" of go is 
important here since it is related to how to put priors. For that purpose, we find 
wavelet theory useful, and use sieve spaces constructed by using wavelet bases. 

3. Wavelets, function spaces and posterior construction 
3.1. Wavelet bases for L 2 [0, 1] 

We review wavelet theory on the compact interval [0, 1]. We refer the reader 
to Hardle et al. (1998), Mallat (2009) and Johnstone (2011, Chapter 7 and Ap- 
pendix B) as useful general references on wavelet theory in the statistical (and 
signal processing) context. 

Let (ip, t/j) be a Daubechies pair of the scaling function and wavelet of a mul- 
tiresolution analysis of the space L 2 (R) of order N, with ip having A vanishing 
moments and support contained in [— JV+1, N], and tp having support contained 
in [0,2A — 1] (see Hardle et al., 1998, Remark 7.1). We translate (p so that its 
support is contained in [—A + 1, N]. Define 



Then, for any fixed Jo > 0, it is known that {tpj k,ipjk, j > Jo,k € Z} forms 
an orthonormal basis for L 2 (R). However, we need an orthonormal basis for 
L 2 [0, 1]. From the Daubechies pair (ip,ip), we wish to construct an orthonormal 
basis for L 2 [0, 1]. The construction here is based on Cohen et al. (1993, Section 
4). Sec also Chapter 7.5 of Mallat (2009) for wavelet bases on [0, 1], 

Take a fixed resolution level j such that 2 j > 2N. For k = N, . . . , ' - N - 1, 
ipjk are supported in [0, 1] and left unchanged: f^ix) = <fjk{x) for x £ [0, 1]. 
At boundaries, k = 0, . . . , N — 1, construct some functions ip^ with support 
[0, N + k] and <p? with support [—A — k, 0], and define 



properties (i) dim(V^) = 2 J ; (ii) Vj C Vj+i; (iii) each Vj contains all polynomials 
of order at most A — 1. 



y jk {x) = 1?l\{px - k), ip jk (x) = V'mVx - k). 
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Turning to the wavelet spaces, define Wj by the orthogonal complement of 
Vj in Vj+i. Starting from the Daubechies wavelet t/j, construct ip 1 ^ similarly to 
iPj£. Then, we have Wj = spanj^j^, k — 0, . . . , 2 3 ' — 1}, and for any J > 1 with 
2 J " > 2N and J > J , 

Vj = V Jo Wj, L 2 [0, 1] = Vj Wj. 

j>.Jo j>Jo 

Therefore, {v ? X t fcJ'fc=o~ 1 u {^jkiJ — Jo, k — 0, . . . , 2 J — 1} forms an orthonormal 
basis for ^[O, 1] (see Section 4 of Cohen et al., 1993, for formal proofs of these 
results) 

To make the notation simpler, define functions fa , fa , ■ ■ ■ by 

/ int / int / int 

fa = Vj a ,0> fa = <PJ ,1 fa-'a = fjo^o-V 

= 1>J*fl, 02^0+2 = tfjo,l, 02^o + i = ^, 2 J -1. 

02 J o + ! + l = i'X+l-O' ^2 J o + i+2 = + ■ • ■ ' <^2 J o + 2 = ^+1,2-^0+1-1! 

faj+1 = Ipjfl, faj+2 = lpj$_, 02J + 1 = 



so that we have 

Wj^to 1 U j > Jo, * = 0, . . . , 2^" - 1} = i > 1}. 

Note that 

Vj = spanj>i, ...,faj], j > J - 

Definition 1. Call the so-constructed basis {fa, I > 1} the CDV (Cohen- 
Daubechies-Vial) wavelet basis for L2[0, 1] generated from the Daubechies pair 
(ip, ip). If (ip, ip) is iS-regular, i.e., if (ip, ip) are iS-times continuously differentiablc, 
then call the so-generated CDV wavelet basis {fa, I > 1} S'-rcgular. 

Remark 1. For any given positive integer S, there is an ^-regular Daubechies 
pair (<p, ip) by taking the order N sufficiently large (see Hardle et al., 1998, 
Remark 7.1). 

Finally, denote by Pj the projection operator from L2[0, 1] onto the j-th 

multiresolution space Vj, i.e., for any g = hfa € I/2[0, 1], Pjg = Ya=i ^ifa- 

In what follows, for any J G N, the notation of kind b J means that it is a 
vector of dimension 2 J . For example, b J = (bi, . . . , b 2 j) T ■ 

3.2. Function spaces 

We recall the definition of Besov spaces. 
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Definition 2. Let < s < S, s G M, S G N and 1 < p, q < oo. Let {fa, I > 1} 
be an S'-regular CDV wavelet basis for L 2 [0, 1]. Denote by h(f) = J ffa the 
generalized Fourier coefficients of / G ^[0, 1]. Then the Besov space is 
defined by the set of functions {/ G ^[0, 1] : ||/||s,p,g < oo}, where 



with the obvious modification in case p = oo or q = oo. 

Remark 2. Besov spaces cover commonly used smooth function spaces. For 
example, B^ x is equal to the Holder-Zygmund space, which coincides with 
the classical Holder space for non- integer s. For integer s, they do not coincide 
but the Holder-Zygmund space contains the classical Holder space. Furthermore, 
B2 2 is equal to the classical L2-Sobolev space when s is an integer. 

Remark 3 (Approximation property). For either g G x or i?f 2 , we have 
\\g — -P7.9II 2 < C2~ 2Js for all J > Jo. Here the constant C depends only on s 
and the corresponding Besov norm of g. 

As a parameter space, we assume Q = Q S = B^ ^ (Holder-Zygmund) or _B| 2 
(Sobolev) for some s > 1/2. Note that in either case Q s C C[0, 1]. 
In what follows, 

take and fix an S'-regular CDV wavelet basis {fa, I > 1} with S > s. 
We keep this convention throughout the analysis. 

3.3. Posterior construction 

To construct quasi-posterior distributions, we have to estimate m(-,g) and con- 
struct a sequence of sieve spaces for Q s on which priors concentrate. For the 
former purpose, we use a wavelet series estimator of m(-,g). For the latter pur- 
pose, we construct a sequence of sieve spaces formed by the wavelet basis. 
For J > Jq, define the 2 ,7 -dimcnsional vector of functions 4> J (w) by 



Let J„ > Jo be a sequence of positive integers such that J n — > 00 and 2 Jn = o(n). 
Let 



ll/l 




fa J (w) = (fa(w),...,faj(w)f. 



m{w,g) = J "H T (E„[0 J "(^)® 2 ])" 1 ]E„[0 J "(W l )(r i - g(Xi))], 
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which is a wavelet series estimator of m(-,g) (replace the inverse matrix by 
the generalized inverse if the former does not exist; the probability of such an 
event converges to zero as n — > oo under the assumptions below). We use this 
wavelet series estimator throughout the analysis (see the remark at the end of 
the section). 

For the same J„, we shall take Vj n = span{0i, . . . , </> 2 ./„ } as a sieve space for 
Q s . We consider priors II„ that concentrate on Vj n , i.e., Yl n (Vj n ) = 1. Formally, 
we think of that priors on g are defined on the Borel cr-field of C[0, 1] (hence 
the quasi-posterior H n (dg | T> n ) is understood to be defined on the Borel cr-field 
of C[0, 1], which is possible since the map g H> p g (T> n ) here is continuous on 

C[0,1]). Since the map 6 J " = (b u . . . , b 2 ., n ) T h> Y?i=i ^i,R 2J " -> C*[0,1], is 
homeomorphic from R 2 " onto Vj n , putting priors on g 6 Vj n is equivalent to 
putting priors on b Jn S R 2 " (the latter arc of course defined on the Borel cr-field 
of R 2 "). Practically, priors on g € Vj n are induced from priors on b " € R . 
For the later purpose, it is useful to determine the correspondence between 
priors for these two parametcrizations. Unless otherwise stated, we follow the 
convention of the notation such that: 

n„ : a prior on b Jn <E R 2 ™ <-> II„ : the induced prior on g € Vj n . 

We shall call n„ a generating prior, and H n the induced prior. 

Correspondingly, the quasi-posterior for b Jn is defined. With a slight abuse of 

notation, for g = Yli=i bi4>i, we write rh(w, b ™) = rh(w,g), and takept./„ (T> n ) = 
exp{ — (n/2)E n [rh 2 (Wi, b Jn )]} as a quasi-likelihood for b Jn . Note that in this 
particular setting, the log quasi-likelihood is quadratic in b Jn . Let Il n (db Jn | T>„) 
denote the resulting quasi-posterior distribution for b Jn , i.e., 

n n ( d b^ | v n ) = E^M^l . (7) 

1 ' j J PbJn (V n )Il n (db^) { ' 

For the quasi-Bayes estimator c/qb defined by (6), since for every x e [0, 1], 
the map g <-> g(x) is continuous on C[0,1], and conditional on T> n the quasi- 
posterior H n (dg | T> n ) is a Borel probability measure on C[0, 1], the integral 
J g(x)TL n (dg \ T> n ) exists as soon as J \g(x)\H n (dg | T> n ) < oo. Furthermore, 
cjqb can be computed by using the relation 

j g(x)U n (dg | V n ) = 4> J -{x) T J b J "IL n (db J " \ V n ) 

as soon as one of the integral on the right side exists. Hence, practically, it is 
sufficient to compute the expectation of Ii n (db Jn \ T>„). 

Remark 4. The use of the same wavelet basis to estimate m(-,g) and to con- 
struct a sequence of sieve spaces for Q s is not essential and can be relaxed. 
Suppose that we have another CDV wavelet basis {0;} for L 2 [0, 1] and use this 
basis to estimate m(-,g). Then, all the results below apply by simply replacing 
<t>i(Wi) by cf>i(Wi). To keep the notation simple, we use the same wavelet basis. 
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However, the use of the same resolution level J n is essential (at least at 
the proof level) in establishing the asymptotic properties of the quasi-posterior 
distribution. It may be a technical artifact, but we do not extend the theory in 
this direction since there is no clear theoretical benefit to do so. 

Remark 5. The use of CDV wavelet bases is not crucial and one may use 
other reasonable bases such as the Fourier and Hcrmitc polynomial bases. The 
theory below can be extended to such bases with some modifications. However, 
CDV wavelet bases are particularly well suited to approximate (not necessarily 
periodic) smooth functions, which is the reason why we use here CDV wavelet 
bases. On the other hand, for example, the Fourier basis is only appropriate to 
approximate periodic functions and it is often not natural to assume that the 
structural function go is periodic. 

4. Theoretical analysis 
4-1- Basic assumptions 

We state some basic assumptions. We do not state here assumptions on priors, 
which will be stated in the theorems below. In what follows, let C\ > 1 be some 
constant. First of all, we assume: 

Assumption 1. (i) (X, W) has joint density fx,w{x,w) on [0,1] 2 satisfy- 
ing that fx,w(x,w) < Ci, \/x,w € [0,1]. (ii) Denote by fw(w) the density 
ofW, i.e., fw{w) = J fx w(x,w)dx. Then, fw{w) > C-f^Vio G [0,1]. (Hi) 
su Pwe[0!l] E[V 2 \ W = w] <d. 

Assumption 1 is a usual restriction in the literature, up to minor differences 
(see Hall and Horowitz, 2005; Horowitz, 2012). Denote by fx{x) the density of 
X, i.e., fx{x) = J fx.w{x,w)dw. Then, we have fx(x) < C±,\/x £ [0,1] and 
fw{w) < C x ,Vw€ [0,1]. 

For identification of go, we assume: 

Assumption 2. The linear operator K : 1] — > ^[0, 1] is infective. 

For smoothness of go, as mentioned before, we assume: 

Assumption 3. Els > 1/2, go G G s , where Q s is either ^ or _B| 2 . 

We refer the reader to Newey and Powell (2003) and d'Haultfoeuille (2011) 
for discussion on the identification issue. We should note that restricting the 
domain of if to a "small" set, such as a Sobolev ball, would substantially re- 
lax Assumption 2, which however requires a different analysis. For the sake of 
simplicity, we assume the injectivity of K on the full domain. 

Remark 6. Liao and Jiang (2011) formally allowed for the case in which (in 
our notation) K is not injective, i.e., {g : Kg = 0} 7^ {0}. However, their As- 
sumption 4.5 (i) indeed implies the injectivity of K when the basis used is an 
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orthonormal basis of -^[0, 1]. In the case that K is not injective, their Assump- 
tion 4.5 requires us to have some a priori knowledge on the eigen-structure of 
K*K (K* denotes the adjoint of K), which is typically not available. 

As discussed in Introduction, solving (2) is an ill-posed inverse problem. Thus, 
the statistical difficulty of estimating g depends on the difficulty of "inverting" 
K, which is usually referred to as "ill-posedness" of the inverse problem (2). 
Typically, the ill-posedness is characterized by the decay rate of ki — > («; is 
the l-th largest singular value of K), which is plausible if K were known and 
the singular value decomposition of K were used (see Cavalier, 2008). However, 
here, K is unknown and the known wavelet basis {4>i} is used instead of the 
singular value system. Thus, it is suitable to quantify the ill-posedness using the 
wavelet basis {</>;}. To this end, define 



tj = Smin (E[</» J (^)0 J (A) T ]) = s min {{{<t>u K(t> m ))i<i_ rn <2-') , J>Jo- 



This quantity corresponds to what is called "sieve measure of ill-posedness" 
in the literature (Blundcll et al., 2007; Horowitz, 2012). We at least have to 
assume that r j > for all J > Jq. Note however that 



= min \\((^i, K 9))i<i<2 J \\e^ 
9eVj,||s||=i 

< min H-Kffll (Bcsscl's inequality) 
9eVj,||ff||=i 

< k 2 j , (Courant-Fischer-Weyl's minimax principle) 



by which, necessarily, tj — > as J — > 00. For this quantity, we assume: 



Assumption 4. (i) 3r > 0, tj > C^2~ Jr , VJ > J ; (ii) \\E{(f> J (W)(g - 
Pjg )(X)}\\ e2 (= \\({<l>i,K(g - Pj<?o)»?=ill^) < CiTj\\go - Pjg \\, VJ > J . 



Assumption 4 (i) lower bounds tj as J —> 00, thereby quantifies the ill- 
posedness. Here wc only consider mildly ill-posed cases for some technical rea- 
sons. This rules out e.g. the case in which the joint density fx,w{x, w) is analytic 
(see Kless, 1999, Theorem 15.20). 

Assumption 4 (ii) is a "stability" condition about the bias go — Pjgo, which 
states that K(go — Pjgo) is sufficiently "small" relative to go — Pjgo- Note that 
in the (ideal) case in which K is self-adjoint and {cf>i} is the eigen-basis of K, 
{<f>i,K(g - Pjgo)) = for allZ = 1, . . . , 2 J , in which case Assumption 4 (ii) is 
trivially satisfied. Assumption 4 (ii) allows more general situations in which K 
may not be self-adjoint and {4>i} may not be the eigen-basis of K by allowing 
for a certain "slack" . This assumption, although looks technical, is common in 
the study of rates of convergence in estimation of the structural function go- 
Indeed, essentially similar conditions have appeared in the past literature such 
as Blundell et al. (2007); Chen and Reiss (2011); Horowitz (2012). For example, 
Blundcll et al. (2007, Assumption 6) essentially states (in our notation) that 
\\K(go — PjSo)\\ < C\Tj\\go — Pjgo\\, which implies our Assumption 4 (ii) since 



\\{{(j} h K{g - Pjgo)))Li\\p < \\K{go ~ Pjgo)\\ (Bessel's inequality). 
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Remark 7. For given values of C\ > 1,M > 0, r > and s > 1/2, let 
T = J"(Ci,M, r, s) denote the set of all distributions of (Y, X, W) satisfying 
Assumptions 1-4 with ||<7o||s,oo,oo < M in case of Q s = ^ and ||.golU,2,2 < M 
in case of Q s = Bf 2 - By Hall and Horowitz (2005); Chen and Reiss (2011), it is 
shown that the minimax rate of convergence (in || • ||) of estimation of go over 
this distribution class T is n - s /( 2r + 2s + 1 ) as the sample size n — > oo. 

Suppose now that for some e„ — > 0, sup^^Ei^H^g '■ \\fj — <?o| > £n | 
P n )] — > 0. Then, by Theorem 2.5 of Ghosal et al. (2000), there exists a point 
estimator that converges (in probability) at least as fast as e n uniformly in 
F e T . Here, the quasi-posterior is not a proper posterior, but the proof of 
Ghosal et al. (2000, Theorem 2.5) applies to this case. By this, in the minimax 
sense, the fastest possible rate of contraction of the quasi-posterior distribution 

H n (d 9 \V n ) is n -s/(2r+2 S +l). 



4- 2. Main results 

In what follows, let (Yi, X\, W\), ■ . ■ , (Y n ,X n , W n ) be i.i.d. observations of (Y, X, W). 
Denote by by = (&oi, • • • , b 2 J ) T the vector of the first 2 J generalized Fourier 
coefficients of go , i.e., boi = J 4>u 9o- Let || • ||tv denote the total variation norm 
between two distributions. 

Theorem 1. Suppose that Assumptions 1-J± are satisfied. Take J n in such a way 
that J n — > oo and 2 Jn = o((n/ logn) 1 ^ 2r+1 ^). Let e n be a sequence of positive 
constants such that e n — > and ne 2 > 2 Jn . Suppose that generating priors TL n 
has densities TT n on R 2 " and satisfy the following conditions: 

PI) (Small ball condition) There exists a constant C > such that for all n 



su 



\fficiently large, H n (b J ™ : \\b Jn — b Q "\\i2 < e„) > e" 



■Cnei 



P2) (Prior flatness condition) Let^n 



+ 2 



J n r, 



There exists a sequence 



of constants L n —> oo sufficiently slowly such that for all n sufficiently 
large, ir n (b Jn ) is positive for all \\b J ™ — b'^Wp < L n j n , and 



\b J n 



sup 

,2<i»7r.,l|6 J " 



\ e 2<L n -y n 



x n (b J n +b J «) 



Then, for every sequence M n oo, we have 



(8) 



\\b J » - b^\\r- > M n (2- J - S + 2 J -^2^/n) | P„| 4 

Furthermore, assume that 2 Jn = o((n/ log n) 1 ^ 2r+3 - ) ). Then, we have 

\\fl n {b J " : V^(b J " - b^) E ■ \V n } 

-7V(A„,$- 1 x $ w $- 1 m/ )(-)IItv4 0. (9) 

Here, A„ = ^^^[^"(W,)^], R t = U t + (<? pQ) - Pj n g (Xi)), Ui = Y t - 
g (X t ), <5> wx = E[$ J " (W)$ J " [X) T ], <P XW = ^ x , and $„ = E[0 J - (W)® 2 }. 
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Proof. See Section 6. □ 

First of all, since for g = b l<t>l, ll.9-.9oi| 2 = ||.9--P/.9o|| 2 + ||.9o--P,/.9o|| 2 < 

\\b Jn — foo"ll^ + 2~ 2JnS , part (8) of Theorem 1 leads to that for every sequence 
M n -> co, 

n„ {<? : ||.9 - go\\ > M n (2- J " s + 2 J " r ^2 J «/n) \ V n j 4 0, 

which means that the rate of contraction of the quasi-posterior distribution 
H n (dg | V n ) is max{2 _J " s , 2 Jnl ~ y/2- 7 ™ /n}. 3 In many examples, for given J n oo 
with 2 J " = o((n/ log n) l /^ 2r+x >), condition PI) is satisfied with e n ~ \/2 J ™(logn)/n. 
Taking J„ in such a way that 2 J ™ ~ ri i/(2r+2s+i)^ w hi c h leads to the optimal 
contraction rate, 7„ in condition P2) is ~ n _s /( 2r+2s+1 )(logn) 1/ ' 2 . So condition 
P2) states that, to attain the optimal contraction rate, the prior density jt n 
should be sufficiently "flat" in a ball with center 6g" and radius of order (es- 
sentially) r j,- s /( 2r + 2s + 1 ). Some specific priors leading to the optimal contraction 
rate will be given in Section 5. 

As noted before, in many examples, for given J n — > oo with2 J " = o((n/ log n) 1 ^ 2r+1 ' ) ), 
condition PI) is satisfied with e n ~ \/ '2 Jn (log njjn. Inspection of the proof shows 
that, without condition P2), this already leads to contraction rate max{2 _J " s , 2 JnT yj2 Jn (log n)/n}, 
which reduces to (n/ logn)- s /( 2r+2s+1 ) by taking 2 J " - (n/logn) 1 /( 2r+2s+1 ). 
However, this rate is not fully satisfactory because of the appearance of the log 
term. Condition P2) is used to get rid of the log term. 

Under a further intcgrability condition about U, M n — ► oo in (8) can be 
replaced by a large fixed constant M. 

Theorem 2. Suppose that all the conditions that guarantee (8) in Theorem 1 
are satisfied. Furthermore, assume that sup^gjQ ji E[J7 2 1(|C/| > A) | W — w] — > 
as A — > oo. Then, there exists a constant M > such that 

fl„ |& J " : \\b J » - We- > M(2- J - s + 2 J - r y/2^/n) | 2? B | 4 0. (10) 

Proof. See Appendix B. □ 

The proof consists in establishing a concentration property of the random 
variable ||E„[ ( /) J "(Wi)C/ i ]||^, which uses a truncation argument and Talagrand's 
(1996) concentration inequality. A sufficient condition that guarantees that 
su P^e[o,i] E[U 2 1{\U\ > A) | W = w] -> as A -> oo is that 3e > 0, sup^^ E[|t/| 2+£ | 
W = w] < oo. 

The second part of Theorem 1 states a Bernstein- von Mises type result for 
the quasi-posterior distribution TL n (db n | T> n ). A difference from the standard 
Bernstein- von Mises theorem is that the covariance matrix of the centering vari- 
able ®^ xy /nE. n [0 J ™ (Wi)Ui] (without the bias part) is ^ x E[a 2 {W)<j) J " {W)® 2 )§ x x w 

3 We have ignored the appearance of M n — > oo, which can be arbitrarily slow. A version in 
which M n is replaced by a large fixed constant M > is presented in Theorem 2. 
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with Oq(W) = E[U 2 | W] and different from ®w X ®ww®xw (which is the 
reason why we added "type"). This is a generic nature of quasi-posterior dis- 
tributions. Even for finite dimensional models, generally, the covariance matrix 
of the centering variable does not coincide with that of the normal distribution 
approximating the quasi-posterior distribution (see Chernozhukov and Hong, 
2003). 

An alternative expression of (9) is stated as follows. Let b ™ denote a "max- 
imum quasi-likelihood estimator" of 6 ", i.e., 

b Jn € arg max p b j n (V n ). 

Here, note that g(-) :— (f> Jn (-) T b Jn is a maximum quasi-likelihood estimator of 
go over Vj n , i.e., g € argmaxggy;^ p g (T> n ) and essentially the same as the sieve 
minimum distance estimator of Blundell et al. (2007). Under the assumptions 
of Theorem 1, with probability approaching one, b Jn = ^^' x 'E n [(j) Jn (Wi)Yi\ = 
6q" + $^/ L x E„[0 J " (Wi)Ri\. Given the proof of Theorem 1, it is not hard to see 
that 

||fl„(- | V n ) - N(b J " , n- 1 $ H/ 1 x $ w $ x 1 w )(')llTV 4 o, 

which is perhaps a more interpretable form of the asymptotic normality of the 
quasi-posterior distribution Yl n (db Jn \ T> n ). 

Finally, we consider the convergence rate of the quasi-Bayes estimator gQB 
of g defined by (6). 

Theorem 3. Suppose that all the conditions of Theorem 2 are satisfied. Let 
gQB be the quasi-Bayes estimator defined by (6). Then, P{2?,i : / \g(x)\Tl n (dg \ 
T> n ) < oo,V:r G [0, 1]} — > 1, and there exists a constant M > such that 



\9qb - go\ 



< il/max{2~ 



2^/n,2 J " r e „^(logn) 1 / 2 }l ^1, (11) 



wher 



\\b J n 



sup 

f 2<L„7. 1 ,l|b J "ll £ 2<£ n 7n 



^n(b J Q n 



6 J ") 



Here e n ,j n and L n are given in the statement of Theorem 1. 
Proof. See Section C. 



□ 



Theorem 3 is not directly deduced from Theorem 1. Indeed, \\g — go\\ may 
not be bounded on the support of n„ since the support of n„ is allowed to 
be unbounded in || • ||, and hence the argument used in Ghosal et al. (2000, 
p.506-p.507) can not apply here (in Ghosal et al. (2000), a typical distance to 
measure the goodness of a point estimator is the Hellinger distance and uni- 
formly bounded). Hence, an additional work is needed to prove Theorem 3. 
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The convergence rate of the quasi-Bayes estimator is determined by the three 
terms: 2~ JnS , 2 Jnr yj2 J ™ /n, and 2 J " r e„p n (log n) 1 / 2 . The last term is typically 
small relative to the other two terms. Indeed, as noted before, in many ex- 
amples, for given J n — > oo with 2 Jn = o((n/logn) 1 /( 2r+1 )), e„ can be taken 
in such a way that e„ ~ y/2 Jn (\ogn)/n. In that case 2 J " r e„£> TI (logn) 1 / 2 ~ 
2' / " r \/2' / " Jn x £>„(logn), and as long as g n — ¥ sufficiently fast, i.e., = 
0((logn) _1 ), the convergence rate of the quasi-Bayes estimator <jqb reduces 
to max{2~ J " s , 2 J ™ r y/2 J ™ /n}, which further reduces to n - s /( 2r + 2s + 1 ) if we can 
take 2 J " ~ n l/(2r+2s+i) _ ratg n - s /(2r+2s+i) j g mm i max optimal under 

the present setting (see Remark 7). Note here that by inspection of the proof, 
(logn) 1 / 2 in (11) indeed can be replaced by any other sequence slowly divergent 
as n — > oo. We do not persue the generality in this direction. 

5. Prior specification: examples 

In this section, we give some specific sieve priors for which the quasi-posterior 
distribution (the quasi-Bayes estimator) attains the minimax optimal rate of 
contraction (convergence, respectively). We consider two types of priors, namely, 
shrinking priors and non-shrinking priors. By a shrinking prior, we mean a prior 
that has smaller weights on bi for larger I. A non-shrinking prior is a prior that 
is not a shrinking prior. 4 

5.1. Non-shrinking priors 

We first consider non-shrinking priors. 

Proposition 2. Suppose that Assumptions 1-4 are satisfied. Consider the fol- 
lowing two classes of prior distributions on R 2 " : 

(Product prior) Let q{x) be a probability density function on R such that for 
a constant A > sup ;>1 |&oz| ; 1) li x ) * s positive on [—A, A]; 2) logg(x) is 
Lipschitz continuous on [—A, A], i.e, there exists a constant L > possibly 
depending on A such that | log q(x) — log q{y)\ < L\x — y\, Vx,y € [—A, A]. 

Take the density of the generating prior by TT n (b ™) = Yli=i 
(Isotropic prior) Let r(x) be a probability density function on [0, oo) having 
all moments such that: 1) for a constant A > \\go\\, r(x) is positive and 
continuous on [0, A]; 2) for a constant c > 0, J Q x k ~ 1 r(x)dx < e cfclogfe 
for all k sufficiently large. Take the density of the generating prior by 
^(6 J ")cxr(||& J "||,2). 

Let 2 Jn ~ n i/(2r+2s+i)^ Tfi erii i n either case, for every sequence M n — > oo, 
we have Ii n {g : \\g - g \\ > M n n- s ^ 2s+2r+1 ^ | V n } 4 0. Furthermore, if 
sup^grQ !] E[C/ 2 1(|C/| > A) | W = w] as X oo, then there exists a constant 

M > such that n n {g : \\g - g \\ > Mn- s ^ 2s+2r+1 ^ \ V n } 4 0. 

4 This terminology is only for convenience and not strictly well-defined. 
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Proof. See Appendix D. □ 

Proposition 2 shows that a wide class of non-shrinking priors lead to the op- 
timal contraction rate. In either case of product or isotropic priors, the constant 
A is not necessarily known, which allows q(x) and r{x) to have unbounded sup- 
port. For example, in the former case, q(x) may be the density of the standard 
normal distribution, in which case A can be taken to be arbitrarily large. Like- 
wise, in the latter case, r(x) may be the density of an exponential distribution: 
r(x) = \e~ Xx ,x > for some A > 0. In the isotropic prior case, r(x) should 
have all moments, i.e., J Q x k r(x)dx < oo for all k > 1, which ensures that 
Tt n {b Jn ) oc ?" ( 1 1 £>* 7 " ||^2) is a proper distribution on R 2 ™ for every n > 1. 

For the quasi-Baycs estimator c/qb, we have: 

Proposition 3. Suppose that Assumptions 1-4 are satisfied. Furthermore, as- 
sume that sup^gjQ X ] E[J7 2 1(|J7| > A) | W = w] — > as A — > 00. Consider 

the two classes of prior distributions on M. 2 '" given in Proposition 2. In the 
isotropic prior case, assume further that r(x) is Lipschitz continuous on [0,A]. 
Let 2 Jn ~ n 1 /( 2r +' 2s + 1 ) _ Then, in either case of product or isotropic priors, there 
exists a constant M > such that P{\\g QB - 9o\\ > Mn- s / ( - 2r+2s+1) } -> 0. 

Proof. See Appendix D. □ 



5.2. Shrinking priors 

We next consider shrinking priors. 

Proposition 4. Suppose that Assumptions 1-4 are satisfied. Furthermore, as- 
sume that sup^gjo.i] E[[/ 2 1(|C/| > A) | W = w] — > as A —> 00. Consider either 
case (a) or case (b) below: 

Case (a) go € ^ , and let the generating prior II„ be the distribution 
of b Jn = (61, . . . , b 2 j„) T constructed by the following steps: 1) Generate 
ui, . . . ,u 2 .in ~ U[— A n ,A n ] i.i.d. with A n ~ (log n)v / 2 J "/ 2,) Let bi = u\ 
for I = l,...,2 Jo and 6 2J+fc = 2-^ s+1 ' 2 )u 2]+k for k = 1 2<:./ = 

■>0; • • • , — !■ 

Case (b) go € B 2 2 , and Ze< i/ie generating prior H n be the distribution ofb Jn = 
(61, ... , &2 J n ) T constructed by the following steps: 1) Generate Ui, . . . , u 2 .j n ~ 
iV(0, A 2 ) i.i.d. TOtt A„ - (logn)v / 2 7 ^; 2j Le£ b t = u t for 1 = 1,.. . , 2 J ° 
and b 2j+k = 2-^ s+1 /2) U23+fe for k = 1, . . ., 2*'; j = J 0l . . . , J„ - 1. 

Lei 2 J " ~ 7j 1 /( 2, '+ 2s + 1 ) . Then, in either case, there exists a constant M > 
«ucA ttai n„{ 5 : - ,g || > Afn- s /( 2s+2r+1 ) | P„} 4 and F{\\g QB - g \\ > 

Mn -s/(2r+2s+l) } ^ Q_ 

Proof. See Appendix D. □ 

Proposition 4 shows that a class of shrinking priors, suitably rcscalcd by the 
factor A n — > 00, leads to the optimal convergence rate. The rescaling is used to 
guarantee sufficient "flatness" of the priors. 
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From a theoretical point of view, using non-shrinking priors is sufficient to 
achieve the optimal convergence rate. However, practically, it would be beneficial 
to use shrinking priors since e.g. putting the prior in case (a) roughly means 
adding a penalty on the magnitude of the Holder(-Zygmund) norm, which would 
result in a numerical stability (likewise, putting the prior in case (b) roughly 
means adding a penalty on the magnitude of the Sobolov norm). 

6. Proof of Theorem 1 

Before proving Theorem 1, we first prepare some technical lemmas (Lemmas 1-4) 
and establish preliminary rates of contraction for the quasi-posterior distribution 
(Proposition 5). Proofs of Lemmas 1, 2 and 4 are given in Appendix A. For the 
notational convenience, define the matrices 

%x = E n [0 J »(Wi)^ / "(X i ) T ], = ®wx, and ® ww = E n $ J " (W,)® 2 ]. 

Recall that <P WX = E[$ W x] = E[<fi Jn (W)<p Jn (X) T ] and <P WW = E[$ff ff ] = 
E[(j) J "(W)® 2 }. 

Lemma 1. Suppose that Assumptions 1-4 are satisfied. Let J n — > oo as n — > oo. 
(i) There exists a constant D > such that sup^gjQ ^ ||</>' 7 (u')||£2 < D2 J / 2 

for all J > J . (ii) Cr 1 < s min {E[4> J {W)® 2 ]) < w(E[«^ (W)® 2 ]) < C x 
and s m ^{E[4> J {W)(f> J {X) T \) < d for aU J > J . (iii) If J n 2 J »/n ->■ , 
\\$ww ~ ^wwWop = O p (V J n 2 J " In) and \\$ W x ~ $wx\\op = O p {^J J n 2 J ™ /n). 
(iv) \\E n [^{W t )RMl = Op{2 J "/n + T 2 n 2~ 2J - s ). (v) IfJ n 2 J ^+i)/ n -> 0, 

(&wx) > (1 — °p(l)) r J„ • 

The following lemma characterizes the total variation convergence between 
two centered multivariate distributions with increasing dimensions via the speed 
of convergence between the corresponding covariance matrices. 

Lemma 2. Let £„ be a sequence of symmetric positive definite matrices of 
dimension k n — »• oo as n — > oo such that ||S„ — /fe„||op = °{^n )■ Then, as 
n — > oo, 

J \dN(0, £„)(*) - dN(0,I kn )(x)\dx -> 0. 

The following lemma is due to Lemma 4 of Bontcmps (2011). 

Lemma 3. Let Z be a k-vector of constants with k 6 N. Then, \\N(Z,Ik) — 
iV(0,/ t )||TV < \\Z\\ e 2/V2^. 

The following lemma was used in the proof of Theorem 1 . 

Lemma 4. Let A n be a sequence of random k n x k n matrices where k n is 
either bounded or k n — > oo as n — > oo. Suppose that there exist sequences of 
positive constants e n ,S n and a sequence of non-random, non-singular k n x k n 
matrices A n such that e n ->• 0, 5 n 0,s min (A n ) > e„, \\A n - A„|| op = Op(S n ) 
and e~ 6 n — > 0. Then, A n is non-singular with probability approaching one and 
Hvl" 1 ^ - h„\\op = Opie^Sn). Likewise, \\AnA~ 1 - IfcJ| op = P {e~ 1 8 n ). 
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The following proposition gives preliminary rates of contraction for the quasi- 
posterior distribution. 

Proposition 5 (Preliminary contraction rates). Suppose that Assumptions 1-4 
are satisfied. Take J n in such a way that J n — > oo and 2 Jn = o((n/ logn) 1 / ( - 2r+1 )). 
Let e n be a sequence of positive constants such that e„ — > and \pne n — > oo. As- 
sume that a sequence of generating priors Tl n satisfies condition PI) of Theorem 
1. Define the data- dependent, empirical seminorm \\ ■ ||x>„ on M. 2 " by 

\\b J "\\v n = \\<l>wxb J -\\ P , b J " eR 2 ''\ 

Then, we have for every sequence M n — > oo , 

fl n {b J " : ||6 J " - b^\\ Vn > M n (e n + t Ji 2- , - s ) \ V n } 4 0. 

Proof of Proposition 5. Let S n = e n + tj„2~' 7 " s . We wish to show that there 
exists a constant cq > such that 

P {n„(6 J " : ||6 J " - 6 J " \\ Vn > M n 5 n \ V n ) < e - c ° M * nS * } -> 1. (12) 

Note that since \fne n — > oo, nS 2 > ne 2 t — > oo. Below, C\, C2, . . . are some positive 
constants of which the values are understood in the context. 

Recall Ri = Ui + E^Wi boiM^i) = U t + (g {X t ) - P Jn5o pQ)). Then, for 

E n [m 2 (W u b J ")} = -2(6 J " -b^) T $xw$wwM<t> Jn (Wi)Ri) 
+ (b J ~ - b{") T <S> xw $w 1 w®wx(b J " - b J ») 
+ E n \4> J - (W^Rtf^wEn [4> J - {Wi)Ri). (13) 

Since the last term is independent of b Jn , it is canceled out in the quasi-posterior 
distribution. Denote by £ b .r n (T> n ) the sum of the first two terms in (13). Then, 

Yl n {db J - | V n ) oc cxp{-(n/2)l bJn (V n )}n n {db J "). 

Using the fact that for any x, y, c G K with c > 0, 2xy < cx 2 + c~ x y 2 , we have 

V»(2>n) > (A mi „-c)||6 J "-6 / "|| 2 , n 

- c-^jE^iW^RMl, Vc > 0, (14) 

where A m i n and A max are the minimum and maximum eigenvalues of the matrix 
^~ww> respectively. Likewise, we have 

fan (Pn) < (A max + C)||6 J " - 6 7 "|| 2 , n 

+ c-'Xl^WEn^iW^W 2 ,, Vc > 0. (15) 
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Define the event 

£ ln = {D n : A min < O^Cf 1 } U {D n : A max > 1.5Ci} 

U {V n : ||E„[0 J "(^)^]li^ > M n S 2 n }. 
Construct the "tests" ui n by u>„ = l(£i„). Then, we have 

n„(6 J " : \\b J - - &^|| Db > M„5„ | V n ) 
= fl n (6 J " : \\b J " - ^"||p„ > M„5„ | P„){w n + (1 - w n )} 
< w„ + n„(6 J " : ||fe J " - 6 7 '1d„ > M n tf„ | V n ){l - W „). (16) 

By Lemmas 1 (ii)-(iv), we have P(w„ = 1) = P(£i„) — >• 0. 

For the second term in (16), taking c > sufficiently small in (14), we have 

(l-O f exp{-(n/2)V„(P n )}IT Il (d& J ") 

J||6 j «-6 7 "||d„>A/„(5„ 

< exp{-ciM><^ + 0(M n n6l)} 



On the other hand, taking, say c = 1 in (15), we have 
(l-O y eaq>{-(n/2)V» (2?„)}rin(d&' 7 " ) 

> (1 - w„) / exp{-(n/2)V» (P n )}n„(d6 J ") 

> (l-o;„)e- C3M »" £ « / n„(rf6 J "). 

J||6 J --6 '"|| 2 ,„<VTure„ 

Denote by s max the maximum singular value of the matrix $>wx, so that 
\\b J " -tf n \\v n < s max ||fo J " -h J n \\ e . 

Define the event £ 2 n = {V n : s max < 1.5Ci}. By Lemmas 1 (ii) and (hi), we 
have P(£ 2n ) ~> 1. Since M n — > oo, for all n sufficiently large, we have 

l(£an)(l -w„) /"exp{-(n/2)V- (* J ") 

> l(f 2n )(l -w n )e- C3M » ne »n n (6 J » : ||fe J " - 6 7 "||, 2 < e„) 

> l(£ 2n )(l- W „) e -^ M "" e "- c " e " 

> l(f 2 „)(l-o; n )e- C4M »" e n, 

where the second inequality is due to the small ball condition PI). Summarizing, 
we have 

fi n (6 J " : \\b J " - b J -\\ Vn > M n S n | V n )(l - u n ) < l(f 2 c „) + e -^Mln8l +Ci M nn el_ 
Since e„ < S n , we obtain (12) for a sufficiently small cq > 0. □ 
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We are now in position to prove Theorem 1. We will say that a sequence 
of random variables A n is eventually bounded by another sequence of random 
variables B n if P(A n < B n ) — > 1 as n — > oo. 

Proof of Theorem 1. We first note that by Lemmas 1 (ii), (iii) and (v), the 
matrices <&wx and %f are non-singular with probability approaching one. 
Conditional on T> n , define the rescalcd "parameter" 6 Jn = (0\, . . . , 9 2 j„ ) t = 
y/n&wx(b Jn — &□")• By (13), the corresponding "quasi-posterior" density for 
gJn j s gi ven by 

7T*je J " | v n )d6 J * ex n n (b J - + $^ x e J "/VE)dN(A n ,$ ww )(e J ")de J % 

where A„ = ^/nE n [4> Jn (Wi)Ri] (this operation is valid as soon as §wx and 
&ww are non-singular, of which the probability is approaching one). 

The proof of Theorem 1 consists of 3 steps. After Step 1, we will turn to the 
proof of (8). The remaining two steps are devoted to the proof of (9). 

Step 1. We first show that 

J K(8 J " | V n ) - dN(A n ,$ ww )(9 J ")\d9 J » 4 0. (17) 

In this step, we do not assume 2 J ™ = o((n/logn) 1 / ( - 2r+3 - ) ). As before, let 5 n = 
£n + Tj n 2~ nS . By Proposition 5, for every sequence M„ — > oo, 

/ ir'(6 J »\V n )d0 J ~ = l + o P (i), 

J\\8 J "\\ t 3<M n y/itt n 

by which we have 

Left side of (17) 

< / K(0 Jn I V n ) - dN(A n , $ ww )(9 J ")\d9 J " 

J\\6 J n\\ e 2<M„^5 n 

+ [ dN (A n , $ ww )(6 J ")d6 J » +o P (l). (18) 

J\\e J ™\\ e2 >M n ^s n 

By Lemma 1 (iv), ||A n ||^s = Op(y/nS n ), and by Lemmas 1 (ii) and (iii), (1 — 
o P (l))Cf 1 < Sminfeww) < s max ($ww) < (1 + op(l))Ci, so that the second 
integral is eventually bounded by 

f dN(0,I 2 .,„)(9 J ")d9 J ". (19) 

Here, note that M n is replaced by y/M n to "absorb" the constant. By Borell's 
inequality for Gaussian measures (see, for example, van der Vaart and Wcllncr, 
1996, Lemma A.2.2), for all x > 0, 

P(||A(0,/ 2 .aOII^ > V2^ + x) < 2- x2 ' 2 . (20) 
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Here since n5 n > ne n > 2 J ™, y/M n n6 n /V2 Jn — > oo, so that the integral in (19) 
is o(l). 

It remains to show that the first integral in (18) is op(l). This step uses a 



standard cancellation argument. Let C n := {9 " G 



|0 J "||<* < M ny /nS n ). 



First, provided that H^wjfllop — j 
l.SMnry 1 ^ < 1.5M n (2- J -« + Ci2 J » r e n ) 



< 1.5t7\ for all J " e C„, \\$^ X 6 J " /^\\ P < 



M n ^/ n . So taking M n — > oo suf- 



ficiently slowly such that M n = o(L n ), \\&wx@ Jn / V™\\t 2 < L n j n and hence 
^n(pQ n + ^wx® Jn /V™) > f° r an n sufficiently large. Here, by Lemma 1 (v), 
we have 



lollop < 1.5r7 B X ) 



Suppose that H^VxIUp < 1-5tj 



1. 

1 Let< iC n (0 J » 



V n )(mddN c *>(A n ,$ ww )(0 J ») 



denote the probability densities obtained by first restricting Tr n (9 Jrl | V n ) and 
dN (A n , §ww)(8 Jn ) to the ball C n and then renormalizing, respectively. By the 
first part of the present proof, replacing 7r*(0 J " | V n ) and dN(A n , $ww)(9 Jn ) 
by TT nCn (0 J ' 1 ' | Z> n ) and d2V c " (A„, %n/)(ff J ») respectively in the first inte- 
gral in (18) has impact at most op(l). Then, abbreviating 7r* c (d9 Jn \ T> n ) 
by < jCn , dN c "(A n ,$ ww )(0 J ") by diV c ", d7V(A„, $ W )(^ J " ) by dN, and 
7Tn(&o™ + ®wx® Jn I 'V™) by 7T n , we have 



J n,C n 



dN 



dN c " 



'nfi„ 

dN/ J Cn dN 



n n dN/ J Cn n n dN 

J c TT n dN 



l n,C n 



TT n J Cn dN 

J Cn n n dN c 



{ n,C„ 



By the convexity of the map ac i — >• |1 
expression is bounded by 



sup 

e J «ec„ J J «ec„ 

which is eventually bounded by 



^™(&o 



x\ and Jensen's inequality, the last 



II b ■>« 



sup 

l<L n j n .\\b J n 



>.<L n ~f n 



*wx6 J "/y/n) 



n n (b J "+b J ") 



n n (bo+ bJn ) 



The last expression goes to zeros as n — > oo by the prior flatness condition P2). 
We now turn to the proof of (8). Take any M n —> oo (this M n may be different 
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sup 

z>0 



from the previous M n ). By Step 1, we have 

fl n {b J " : \\®wx{b J " - b^)\\ P >z\V n } 

dN(n- 1 / 2 A n ,n- 1 $ww)(0 Jn )dO J " 4 0. 

> J "\\ t 2>Z 

Here, by Lemma 1 (v), we have 

\\®wx(b Jn -b^)\\ e . > s min (®wx)\\b J " - b^\\ t , 

> (l-o P (l))TjJb J « -b^\\r-, 

by which we have, uniformly in z > 0, 

n„{6 J " : \\b J " - b^y > 2rjh \ V n } 
<n„{fe J " : || $ wx {b J " -b^)\\ r - >z\V n } + o P (l) 

< [ dN{n- 1 t 2 h n ,n- 1 $> ww )(e J ™)dO J " +o P (l). 

J\\e J ™\\f2>z 

By Markov's inequality, the integral in the last expression is bounded by 

_{||A„||| 2 +tr($ w )}. 

By Lemmas 1 (h)-(iv), we have ||A„|j 2 2 +tr(% w ) = P {2 Jn + nT 2 Jn 2~ 2J ™ S ). 
Therefore, we conclude that, taking z = M n (rj n 2~ JnS + y / 2 Jn /n), Yl n {b Jn : 

\\b J " - 6 7 "|| £ 2 > 2M„(2~ J " S + Tj^2 J «/n) \ V n } 4 0, which leads to the 
contraction rate result (8). 

In what follows, we assume 2 Jn = o((n/ logn) 1 ^ 2 ''" 1 " 3 )), and prove the asymp- 
totic normality result (9). 

Step 2. (Replacement of &ww by &ww)- This step shows that 

\dN{A n ,$ ww ){9 J ") - dN(A n , <I> ww )(9 J ")\d9 J " 4 0, 
which is equivalent to 

\dN(0,$ ww )(9 J ») -dN(0,$ ww )(9 J »)\d9 J " 4 0. 



By Lemmas 1 (ii), (hi) and Lemma 2, this follows if \J J n 2 J ™ Jn = o(2~ Jn ), i.e. 
J n 2 3Jn = o(n), which is satisfied since 2 Jn = o((n/ log n) 1 ^ 2r+3 ^). 
Step 3. (Replacement of <&wx by §wx)- We have shown that 



K(0 J » | V n ) - dN(A n , <S> ww )(6 J ")\d6 J " 4 
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By Schcffc's lemma, this means that 

||fl„{fe J " : V^$wx(b J ~ - &o") S • I T> n } - N(A n , $ww)(-)\\tv 4 0, 

or, 

||n„{& J " : V^(& J " - &o") G ' I 2?n} - iV($^A n , $wx*ww*xV)(OIItv 4 0. 

The last expression is asymptotically valid since <&wx is non-singular with prob- 
ability approaching one. The remaining step is to replace §wx by <&wx- This 
step requires a special care since the minimum singular value of $>wx (while 
positive) is approaching zero as n — > oo. To conclude the theorem, it suffices to 
show the following two assertions: 

\\N{^wX^n^wX^WW^ X w) 

- N(^ x A n , <S>^ x <$> ww <S> x l w )\\?v 4 0, (21) 
\W@wx&™ ^WX^WW^xw) 

- N($^ x A n , *wx*ww*1w)\\tv 4 0. (22) 

Note that ^^A^ = A„. 

Proof of (21): Assertion (21) reduces to 

\\N{Q, <£wx®wx®ww$xw®xw) ~ N(0, <$>ww)\\tv 4 0. 

By Lemm as 1 (ii), ( iii) and Lemma 3, \\®wx®wx®ww$xw®xw -®ww\\ op = 
P (2 J " r v /j « 2 "AO = op(2- J ") (the last equality follows by 2 J " = o((n/ log n) 1 ^ 2 
Since C-f 1 < s m i n (^ww) < s max ($ww) < C\, the desired conclusion follows 
from Lemma 2. 

Proof of (22): Assertion (22) reduces to 

\\N{(* W x $wx - V« )A„, $ww) - N(0, *ww)\\tv $ 0. 

By Lemma 3, and the fact that s m i n ($ ww ) > Cf 1 , the left side is < || (®wx&wx~ 
I 2 Jn )A„||^2. Here, we have 

||($wx^ x -4>n)A«||^ < -4„||op||A„||f2 

= P (rJ^J n 2Jn/n) x O p (^t /ii 2- j '« s + v^«) 
= o P (l), 

where the second line is due to Lemmas 1 (iii), (iv) and Lemma 3. The last line 
follows from s > 1/2 and 2 J ™ = o((n/ log n) 1 /( 2r + 3 '). 

Steps 1-3 lead to the asymptotic normality result (9). □ 
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7. Discussion 

In this paper, we have studied the asymptotic properties of quasi-posterior dis- 
tributions against sieve priors in the NPIV model and given some specific priors 
for which the quasi-posterior distribution (the quasi-Bayes estimator) attains 
the minimax optimal rate of contraction (convergence, respectively). These re- 
sults greatly sharpen the previous work of Liao and Jiang (2011). 

We end the paper with some remarks on the direction of future work. First, 
as also noted by Liao and Jiang (2011), (adaptive) selection of the resolution 
level J n in a (quasi-)Bayesian or "empirical" Bayesian approach is an important 
topic to be investigated. Second, a (quasi-)Bayesian analysis is typically useful 
in the analysis of complex models in which frequentistic estimation is difficult 
to implement due to non-differentiability /non-convex nature of loss functions. 
This usefulness comes from the fact that a (quasi-)Bayesian approach is typically 
able to avoid numerical optimization. See Chernozhukov and Hong (2003) and 
Liu et al. (2007) for the finite dimensional case. In infinite dimensional models, 
such a computational challenge in frequentistic estimation occurs in the analysis 
of nonparamctric instrumental quantile regression models (Horowitz and Lee, 
2007; Chen and Pouzo, 2011; Gagliardini and Scaillet, 2011). In that model, 
a typical loss function contains the indicator function and hence highly non- 
convex. In such a case, the computation of an optimal solution is by itself dif- 
ficult, and a solution obtained, if possible, is typically not guaranteed to be 
globally optimal since there may be many local optima. It is hence of inter- 
est to extend the results of the paper to nonparametric instrumental quantile 
regression models, which is currently under investigation. 
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Appendix A: Proofs of Lemmas 1, 3 and 4 



Proof of Lemma 1. Part (ii) follows from Assumption 1 and the fact that {4>i} 
is an orthonormal basis of £2 [0,1]. Part (hi) follows from Rudclson's (1999) 
inequality and (i). For the reader's convenience, we state Rudclson's inequality 
in Appendix E. For Part (v), we hrst note that, by (hi) and Weyl's pcrtubation 
theorem (Bhatia, 1997, Problem III.6.13), s min ($ wx ) > tj„ - P ( V / J„2 J "/n). 
Since now J n 2 J " jn = o(2~ J " r ) = o(rj n ), we have s m i n (&wx) > (1— °p(1))T7„ ■ 
For the proof of (i), denote by N the order of the Daubechies pair (tp, ip) generat- 
ing the CDV wavelet basis {<f>i, I > 1}. Then, for each x £ [0, 1] and each j > Jo, 
the number of nonzero elements in 4>2i+i{ x )i • ■ • j 02J+ 1 ( x ) is bounded by some 
constant depending only on N, and each 4>2i+k{ x ) is bounded by some constant 
(depending only on tp) times 2 J / 2 for all k = 1, . . . , IK Similarly, <f>\, . . . , (f> 2 j 
are uniformly bounded. Therefore, there exists a constant D depending only on 
((p,ip) such that \\(f> J (x)\\% < D{2 J ° + J2jZj 2 j ) = D2 J for all x e [0, 1]. 



Finally, we wish to show Part (iv). First, observe that ||E„[</r™ (Wi)Ri 



< 



2||E„[f 7 "(W i )i? l ] - ^[4> J "(W)R]\\} 2 + 2\\E[<j) J -{W)R]\\ 2 (2 . By a simple moment 
calculation, the hrst term is Op(2 J ™ jn). For the second term, by Assumptions 
3 and 4 (ii), 

\\n^(W)R]\\l = \W ,n (W)(g a - Pjg )(X)]\\% 
<Tl\\g Q -Pj n g a \\ 2 

< J2 o-2J„s 



This completes the proof. 



□ 



°(' c n 1 )) fc " — Knin.n - 



Proof of Lemma 2. Step 1. We hrst show that |£„| = 1 + o(l) (|S n | denotes 
the determinant of Y> n ). Let A m i n ,n and A ma x,n denote the minimum and max- 
imum eigenvalues of £„, respectively. Then, by Weyl's pertubation theorem, 
1 - o(fc f 7 1 ) < A raini „ < A max ,„ < 1 + o(fc~ 1 ), so that (1 
|S„| < A^' axn = (1 + o{k~ 1 )) kn . Here both sides converge to 1. 
Step 2. By Step 1, we have 



\dN{Q,T, n ){x) - dN(0,I kn )(x)\dx 



1 



< 



(27r) fc -/ 2 
1 



-x T S~ 1 x/2 _ -x T x/2 



dx 



ISnl 1 / 2 

<o(l) + 



(2^)^/ 2 |E„|i/2 



rx ' 2 \dx 



(2^)^/ 2 (l + o(l)) 



o - x T x/2 ^_ xT(1: -l_ Ikn)x/2 _^ dx 



By assumption, we have e„ := HE" 1 - hJ\o P < ||S^ 1 || p||/ i;n - E„|| op =o(fc,7 1 ). 
Now, | e - 2;r ( s -T 1 -^„W2_i| < e <i n x T x/2_ e ~e n x T x/2^ By a direct calculation, the 
conclusion follows from the fact that (1 ± e„) fc " = 1 + o(l). □ 
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Proof of Lemma 4- The first assertion follows from the assumption. Suppose 
now that A n is non-singular. Then, A~ x A n = (A n — A n + A n )~ 1 A n = (A~M„ — 
hn+Ikn)' 1 - Here, A" 1 ^ - I kn = A~ 1 (A n -A n ), so that |j A~ l A n - I kn || op < 
PrT^lopPn - A n \\ QP = s m U A n)\\A n - A n \\ op = Op{e^5 n ). Let A = I kn - 
A~ x A n . Then, A n = (I kn - A)" 1 = I kn + ££ =1 A m (Neumann series). 
Therefore, we conclude that WA^^-I^ || op = || £~ =1 A" l || op < £~ =1 II A||™ = 
IIAIIop-E^ollAII-^Opfe 1 ^). □ 

Appendix B: Proof of Theorem 2 

We first prove the following lemma. 

Lemma 5. Suppose that the conditions of Theorem 2 are satisfied. Then, there 
exists a constant D > such that 

pjllE,,^ 7 "^)^)]!!^ > #yW4 -> o. 



Remark 8. It is standard to show that ||E„[^ J " {W l )U i \ \\ t 2 = P ( v /2 J -/n), 
which, however, does not leads to the conclusion of Lemma 5 since the for- 
mer only implies that for every sequence M n — > oo, P{||E„[0 J ™ (Wj)Uj] ||p > 
M n y/2 J " /n} — > 0. Hence an additional step is needed. The current proof uses 
a truncation argument and Talagrand's concentration inequality. 

Proof of Lemma 5. For a given A > 0, define f/r = L^l(|[/;| < A) and [/+ = 
Ud(\Ui\ > A). Since = E[U \ W] = E[U~ \ W] + E[U+ \ W], we have 

E[(jj J "{W)U+]}, by which we have 

||E„[0 J "(W 4 )C/ 4 ]||£2 < \\n- 1 Y'U{<t> Jn {W l )Ur ^E[^{W)U-]}\\ P 

+ \\n- l T.U{^ Jn (Wi)Ut -E[^(W)U+]}y 
=:I + II. 

First, by Markov's inequality, we have for all z > 0, 

z z nz z 

< sup Mlg[04] E[UH(\U\ > A) | W = w] x E[^(^) 2 ] 

nz 2 

< x sup E[U 2 1(\U\ > A) | W = w], 
nz tue[o,i] 

where we have used that ^f =1 E[<^>^(W) 2 ] = tr($ww) < 2 J " s miiX (§ww) < 
Ci2 J " by Lemma 1 (ii). Thus, we have 



> x /Ci2 J -/n} < sup E[U 2 1(\U\ > A) | W = w\. 

w£[0S] 
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By assumption, the right side goes to zero as A — > oo. 

Second, let = cj) J " {W l )U^ —E[(f> Jn (W)U~] (denote by Z the generic version 
of Zi). Let S 2 ''"- 1 := {a J ™ £ R 2 ''" : ||a J "|| P = 1}. Then, 



= sup E n [(a J "Y Zi}. 

We make use of Talagrand's concentration inequality to bound the tail proba- 
bility of I. For any a Jn G S 2 " _1 , by Lemma 1, we have 

E[{(a J ") T Z} 2 ] < sup E[U 2 | W = w] x s max {$ ww ) < C 2 , 
we [o,i] 

|(a J ") T Z|<A sup ||</> J "(w)|| £ 2 < ZJiAv 7 ^, and 

u>e[o,i] 

(E[/]) 2 < E[/ 2 ] < n- 1 sup E[[/ 2 | W = io] x ^E[<j)i(W) 2 ] < C 2 2 J "/n, 
we[o,i] i=1 

where -Di > is a constant. Thus, by Talagrand's inequality (see Theorem 5 in 
Appendix E), we have for all z > 0, 

¥{I > D 2 {^2 J n/n + y/z~Jn + zXV^/n)} < e~ z , 

where D2 > is a constant independent of A and z. 

The final conclusion follows from taking A = A n — > 00 and z = z n — > oo 
sufficiently slowly. □ 

Proof of Theorem 2. Let D\ , D2 be some positive constants of which the values 
are understood in the context. For either g G or Bf 2 , — ^ 7 r,3o|| = 

0(2~ JnS ) — o(l), by which we have 

2 .7„ 

^Var{E n [^(^)(<7o - Pj n 9o )(*i)]} 

Z=l 

< n- 1 ^E[0 i (iy) 2 {(.g o - Pj^go)^)} 2 ] 

;=i 

= nT 1 ^ // 0/(w) 2 {(_g o - Pj„9o)(x)} 2 fx,w(x, w)dxdw 
1=1 

2 j„ 

< n^dWgo - P Jn g \\ 2 X Y,J Hwfdw 



1=1 

o{2 J "/n). 



imsart-generic ver. 2011/11/15 file: NPIV-r2.tex date: December 21, 2012 



K. Kato/Quasi-Bayes for NPIV 



32 



Rem, 



Thus, 

(Wi)Ri] = E„[^ J «(W i )C/ i ] 

+ E[<f> J "(W)(g -P n g )(X) 

with ||Rcmj|f2 = op(y // 2 J ™ jn). The second term on the right side is 0{rj n 2~ JnS ) 
in the Euclidean norm. Together with Lemma 5, we have 

P{\\E n [^ (W. t )RM% > D^rjj- 2 ^ + 2 J "/n)} -> 0. 
Furthermore, by Lemma 1, we have 

■'max 

Taking these together, we have 

P [||E„[0 J »(Wi)iii]||| 2 + n- 1 ixfiww) < D 2 (t 2 jJ- 2J - s + 2 J »/n) 
By the proof of Theorem 1 , this leads to the desired conclusion. □ 



1. 



Appendix C: Proof of Theorem 3 

For the notational convenience, define 

E n „[ • | V n ] := J ■ U n (dg \ V n ), E fi J 

Proof of Theorem 3. Define the event 

^3n = : &wx and ar e non-singular}. 



V n ] := J ■ fl n (db J ~ | 2?„ 



Then, by Lemma 1, P{1(£ 3 „) = 1} = P(£ 3 „) -> 1. Suppose that 1(£ 3 „) = 1. 
Then, by (13), i h .j n (£>„) defined in the proof of Proposition 5 is bounded from 
below by c|| b Jn \\ 2 2 +a term independent of b Jn for some positive random variable 
c. Hence, the integral E^ [||& J ™||^2 | D n ] is finite as soon as 1(£ 3 „) = 1. This 
proves the first assertion. 

In what follows, we wish to prove the convergence rate result (11). First of 
all, by the triangle inequality and Jensen's inequality, 

l(^3n)||ffQB - ffo|| < 1(^3«)II.9QS - P/.9o|| + 11.90 - P/„.9o|| 

= l(S 3 nWn n [9 - P.J90 I 2? n ]|| + \\g - Pj n g \\ 
= l(£ 3li )||E fin [& J " - b J " | V n }\\ e2 + j|.go - Pj n 9o\\ 



< l(f 3 n)En„[||6 J 



V n ] + ho ~ Pj n 9o\ 



Since \\g - P Jn g \ 
M > such that 



= 0(2 nS ), it suffices to show that there exists a constant 



l(£ 3 n)En n [||6 J "- 6^||/2 \ V n ] 



< A/max{2- J " s ,2 J " r J2- 7 "/n,2 J " r e„£. n (logn) 1/2 } 1 
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Let ir* l (6 Jn | T> n ) be the (random) density denned in the proof of Theorem 
1. Note that 7r*(0 n | T> n ) is well-defined as soon as l(£s n ) = 1. Let 6„ := 
£n + Tj n 2~ n ". Then we have: 

Lemma 6. There exists a constant c\ > such that for every sequence M„ — > oo 
with M n = o(L n ), 

pjl(£ 3 ») J \\0 J "\\f\<{0 Jn \V n ) ~ dN(A n ,$ ww )(9 J ")\d6 J " 

wAere A„ := ^/HE n [<£ J ™ (Wi)^] . 

We defer the proof of Lemma 6 after the proof of this theorem. Here we have 

1 2 

QJn II „A ATI A <T> \(aJn\/iaJn 



l(^n) 



|^diV(A„,$ ww )(0 J «)d0 J 



<l(£ 3n ) / ||0 J »||^iV(A n ,% w )(fl J »)#» 

< HAnllJa +tr(*ww)- 

By the proof of Theorem 2, there exists a constant Z?i > such that P{|| A„|| 2 2 + 
tx($ww) < Di(nT 2 Jn 2~ 2JnS + 2 Jn )} -> 1. Thus, for every sequence M„ -> oo 
with M n = o(L n ), with probability approaching one, 



> l(£ 3n ) J \\6 J o\yw* n (6 J * \ V n ) 

= l(S 3n )V^ J \\®wx(b Jn - b>")\\ r ^n(b J " | V n )db J « 

> l(£ 3 ™)V^s min (4 lV x)E IIn [||& J " - b^\\e- | V n \. 

Take M n — > oo sufficiently slowly such that M n = 0((log n) 1 / 2 ). Since the 
left side is then < malL{^/nTJ n 2~ J " s ,\/2^,^/ne n g n (\ogn) 1 / 2 }, there exists a 
constant D 2 > such that 

HS3n)s min ^wx)^uJ\\b J " - b^\\ e2 I V n ] 



< D 2 max{r, /7i 2- J " s , ^2 J */n, e„£>„(log n) 1/2 } -> 1. 
Finally, by Lemma 1, P(s m in(^ ) WJf ) > 0.5rj n ) — > 1, by which we have 

p[l(£ 3 «)EnJ|fo J "-& 7 "IU 2 l^n] 

1. 



< 2£> 2 max{ 2- J " s , rjV 2 J « /n, r^ 1 e„ g n (log n) 1 / 2 } 
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This leads to the desired conclusion. □ 

Proof of Lemma 6. As before, we say that a sequence of random variables A n 
is eventually bounded by another sequence of random variables B n if P(^4 n < 

B n ) -> 1. 

Take any M n — > oo with M n = o(L n ). Then, 



Jn 



l(S 3n ) J \\6 J «y-\ir*(e J " \ V n ) - dN(A ni $ ww )(9 J ")\d9 J " 

< HSsn) [ \\0 J "\\ P ■ \n*(6 J " | V n ) - dN(A n ,$ ww )(9 J '>)\d9 J » 

J\\e J n\\ e2 <M n ^nS n 

+ l(£ 3 n) [ ||^i,2<(0 J " \V n )d9 J " 

"'||e J "|| £ 2>M„ x Ar5„ 
+ l(£ 3n ) [ \\9 J "\\ £ 2dN(A n ,$ ww )(6 J ")de 

•>\\6 Jn \\ t 2>M n ^i& n 

=:I + 11 + III. 

We divide the rest of the proof into three steps. 

Step 1. Claim: there exists a constant c 2 > such that P(II < e ~ C2M " nS ") -+ 

1. 

(Proof of Step 1): The assertion of Step 1 follows from the same line as in 

2 2 

the proof of Proposition 5 by noting that for any c > 0, xe~ cx < e~ cx ' for all 
x > sufficiently large. Hence the proof is omitted. 

Step 2. Claim: there exists a constant c 3 > such that P(III < e ~ c ^ M ^ nS l) _> 

1. 

(Proof of Step 2): By the Cauchy-Schwarz inequality, the square of 77/ is 
bounded by / \\9 J "\\ 2 e2 dN(A n ^ ww )d9 J " I\\ e jn\\ l2> M n ^i5 n dN{A n ,$ ww )d9 J » . 
Here the first integral is bounded by 

\\A n \\% +tr(* ww ), 

which is eventually bounded by D(riTjJl~ JnS + 2 Jn ) for some constant D > 
by the proof of Theorem 2. On the other hand, by the proof of Theorem 1, the 
second integral is eventually bounded by J||0j„ii 2 > v / j\/ n 8 dN(0, 1 2 i n )d9 J " . By 
Borell's inequality for Gaussian measures (see (20)), the last integral is bounded 
by e - c ' M " nS " for some small constant c' > 0. Taking these together, we obtain 
the conclusion of Step 2 by choosing the constant c 3 > sufficiently small. 

Step 3. Claim: there exists a constant c 4 > such that P(7 < e~ c ^ M l nS l + 
M ny /nS n g n ) -+ 1. 

(Proof of Step 3): Let C n := {9 J » e M 2 '" : \\9 J "\\ e 2 < M ny /nS n }. Let 
7r nC„(^' / " I anc ^ dN Cn (A n , <&ww)(9 Jrl ) denote the probability densities ob- 
tained by first restricting 7r*(# J " | T) n ) and dN(A n ,§ ww )(9 Jn ) to the ball C n 
and then renormalizing, respectively. Then, abbreviating TT^{d9 Jn \ V n ) by 7r*, 
<c„ ( d ° Jn I V n) by < n , dN(A n , $ww)(0 J " ) by dN, and dN c » (A n ,$ ww )(9 J ") 
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by dN Cn we have 

/<i(£ 3 «) / \\e J -\\p-\< fin -dN c -\ 



+ i(£ 3 „) / ll^"IU-l< Cn -<l 

■J\\e J "-\\ t 2<M rl% /nS rl 
+ I(f3n) / !|0 J "||,2 -|rfiV C " -dAT| 

=: IF + F + JV. 

By the proof of Theorem 1, the term IV is eventually bounded by 

l(£ 3n )M n ^5 n J \< fin -dN c "\ < M n V^5 n g n . 
For the term V, we have 



V < l(£ 3n )M n V^5 n I |< iCn 

l|f' 7 "ll«2<M„ v ^5„ 



= l(£ 3n )M nV ^,5 n x 



■/||e J "||^2>M TlV ^: 1 5„ ^ra 
i||e-'n \\ t 2<M n ^ES n n n 



By the proof of Proposition 5, there exists a constant C5 > such that the 

, ,2 r2 

ratio of the integrals on the right side is eventually bounded by e~ C5 ™ r ™, so 
that P(V < e~ C5M ™ ni5 ™/ 2 ) -> 1. Likewise, by Borell's inequality for Gaussian 
measures, there exists a constant > such that ¥(VI < e~° eMn *») — > 
1. Taking these together, we obtain the conclusion of Step 3 by choosing the 
constant C4 > sufficiently small. 

Finally, Steps 1-3 lead to the conclusion of Lemma 6. 

□ 



Appendix D: Proofs for Section 5 

Proof of Proposition 2. For either case of product or isotropic priors, it suffices 
to check conditions PI) and P2) in Theorem 1. We shall do this with the choice 
e n = A /2- / »(logn)/n~ (logn) 1 /a n -(r+«)/(2r+a.+i) i 

Case of product priors: Let c m j n := rava x ^\_A^q{x) > 0. Since \\b Jrl — 
b a n \\% = ELi( & ' - M 2 < 2 J " maxKKjj, (bi - 6 /) 2 , we have 

1 n ' TTIQV 



U n (b J " : \\b J - - b^\\ e . < e n ) > n„ 6 J " : max \k - b ol \ < e„/V2' 

V 1<'<2 7 

2 J„ 



>]JU n (b J " : |fc,-6oi| <e n /V2^). 



1=1 
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Since 3e e (0, A), 6 ; G [—A + e, A — e] for all I > 1, for all n sufficiently large, 
the last expression is bounded from below by 



= -2 J « Iog(V2^/(c mlo e»)) > e -Cne= 



where C > is a sufficiently large constant, which verifies condition PI). 

Second, with this e„, 7„ in condition P2) is ~ (logn) 1 / 2 n _s/ '( 2r+2s+1 ). Let, 
say, L n ~ (logn) 1 / 2 so that L„ 7 „ - (log n)n- s /( 2r+2s+1 ) . Then, {b Jn : \\b J " - 
b'o n \\p < L n "fn} C [— A, A] 2 " for all n sufficiently large, so that TT n (b Jn ) = 
nf=i i s positive for all ||6' 7 " — &Q n ||f2 < L n j n . Let ||fc Jn || f 2 < L n j n and 
||fo J "||p < L nln . Then, 



M^o" + & J ") 



cxp 



^{logg(6 ; +h) -\ogq{b m +h)} 



1=1 



<exp|L^ |6i-6j| 
< exp|L%/2^||6 J " - F"!!^! 



2LV2 J "-L„ 



o(l) 



where the last step is due to s > 1/2 . Likewise, we have 



7Tn(&0™ + 

Therefore, condition P2) is verified 
Case of isotropic priors: Let c K 
sufficiently large, 



fi„(6 J " : ||6 J "-fe 7 "||p <e„) 



-o(l) 



[o,A] f"(a;) > 0. Then, for all n 



Ilf2i f 



r(||6 J "||^)d6 J - 



/r(||&*.||*)d&'» 
V^< e „KII& J " + &o1l^" 



> 



l|b J "ll f 2< e 



Cmin J\\ L.7-. 11 6^ 



J>(||^|!^ J 



ig[0,i 



-min r oo 



,2 J n -1. 



> C n 



-e2 J ™ log(2 J ") 



> e 



Cue 



-2 J * log(2 J "/e„)-c2 J ™ log(2 J ") 
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where C > is a sufficiently large constant, which verifies condition PI). 

Second, with this e„, 7„ in condition P2) is ~ (logn) 1 ' 2 ri. _s /( 2r+2s+1 ). Let 
L n ~ (logn) 1 /2 so that L nln ~ (logn)n" s /( 2r+2s+1 ). Since \\bfr\\p < \\g \\ < A 
and L„i„ -> 0, {6 J " : ||6 J " - < L„ 7 „} C {6 J » : ||6 J "||,2 < A} for all n 



sufficiently large, so that n n (b Jn ) 
L„7„. Let ||6 Jti ||£2 < L n 7„ and ||6' 7 "||£2 < L n ln- Then, by Bessel's inequality, 



>.) is positive for all \\b J ™ — b, 



o IU 2 



< 



\\b^ < ||& 7 "||p +L„ 7 „ -> || 3o || 



and likewise we have 



bM 



> 



Lnln 



I So | 



Therefore, we conclude that 



TTn{bi n 



KIND 



= i 



uniformly in ||b J ™ 
P2). 



< L n j n and 



r(Hflbll) 

6 J "||£2 < i„7„ 



I/O 



which verifies condition 
□ 



Proof of Proposition 3. Given the proof of Proposition 2 and the discussion fol- 
lowing Theorem 3, it is sufficient to verify that g„ is 0((log n) -1 )- However, this 
is readily verified by tracking the proof of Proposition 2. □ 

Proof of Proposition 4- The proof is similar in spirit to that of Proposition 2. 
Hence we only give a sketch of the proof. 
Case (a): Condition PI) is verified with e 



is 



(log n) 



l/2 n -s/(2r+2s+l) 



(logn) 1 / 2 2' 



, ~ yj2 J ™{\ogn)/n. Then, 7n in P2) 
JnS . Because 7r„ is constant on the 



support, condition P2) is verified if the support of 7r„ contains the ball {b 



\ b Jn 



< Lnln} for ah n sufficiently large for some L n — > oo. Let, say, 



L n ~ (logn) 1 / 4 , so that L n7 „ - (log n) 3 / 4 2" J " s . Since {b Jn : ||6 J " - < 
Lnln} C {b Jn : max KK2 j„ |6j — &oi| < ^n7n}i condition P2) is verified if 
L^n = o(A l 2" J " (s+1/2T )7This is satisfied since A n 2- J "( s+1 /2) ^ (log n)2- J " s . 
The second assertion follows because in this case g n — for all n sufficiently 
large. 

Case (b): Condition PI) is verified with e n ~ a/2 j ™ (log n)/n without a signif- 
icant difficulty. Then, 7 „ in P2) is - (\ogn) 1 / 2 n - s ^ 2r+2s+1 '> - (log n) 1 / 2 2"- 7 " s . 
Let, say, L„ ~ (logn) 1 / 2 . To establish the desired conclusion in this case, it is 
sufficient to prove that 



log^ (6 °" 



TTn(b 



Otflogn)- 1 ), 



uniformly in ||6 J ™||f2 < L n ln and ||6 "||^2 < L n ln- Let \\b J ' 
||6 J "||£2 < Lnln- Define at,..., o 2 j„ by a fe = 1 for fc = 1, 



p < L ri in and 
,2 J ° and a 2]+k = 
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2 j(*+i/2) for A; = 1, ... , 2 j ; j = J , . . . , J n - 1. Then, by construction, 



los 



2 ™ 



Observe that 



,2 2 J » 



oJ„(2 S +l)r2 2 



42 Z-( ' 1 — 42 ^ « — 42 

™ Z=l " ;=1 11 

2 J " 2 J " 2 J ™ 



(logn)" 1 , 



(£a?6 0i M 2 < (E a ?^)(E a ' 6 ') ^ JM J -(£a?6?) ^ ^( 2 ^ 2 )L 2 7; 



1=1 



where D > is constant depending only on || go |j 2, s,s- The second inequality 
leads to that 



2 J " 



1=1 



(log n) 1 . 



Therefore, we conclude that 



sup 

l|b J "ll f 2<i„7n,||b J "|l £ 2<£„7n 

This completes the proof. 



= 0((logn)- 1 ). 



□ 



Appendix E: Technical tools 

We state here Rudclson's inequality for the reader's convenience. 

Theorem 4 (Rudelson's (1999) inequality). Let Z\,...,Z n be i.i.d. random 
vectors in R k with S := E[Zf 2 ]. Then, for all k > e 2 , 



E 



(=1 



op 



<max{||E||i/ 2 ,5,(5 2 }, <5 = £>W^E[ max \\Z, 

\ n Ki<n 



where D is a universal constant. 

Rudclson's inequality implies the following corollary useful in our application. 
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Corollary 1. Let (X\, Yf ),..., (X n , F ra ) be i.i.d. random vectors with Xi € 
R k \Yi e R fe2 , and h+k 2 > e 2 . Le£ S x := E[Xf> 2 ], Ey := E[Yf 2 ] and S xy := 
E[-XiYf ]. Suppose that there exists a finite number m such that E[maxi<j< n ||Xi|| 2 2 ] 
E[maxi<i<„ ||li|| 2 2 ] < m. Then, 



E 



1 " 

-Y^X^-^xy 

71 ^ 



^maxIdlSxII^VHSyll^J 2 } 



with 5 = D 



mlog(fci V hi) 



where D is a universal constant. 



Proof. Let Z± = {Xi,Y i ) , and apply Rudclson's inequality to Zi,...,Z n . 
Note that by the variational characterization of the operator norm, we have 
IK 1 ELI X^^xyW^ < ||n- 1 Er=i 2 -E[Zf 2 ]|| op , and by the Cauchy- 
Schwarz inequality, ||E[Zf 2 ]|| op < 2||£ x || op + 2||Ey|| p. □ 

Finally, we recall celebrated Talagrand's (1996) concentration inequality for 
general empirical processes. The following version is due to Massart (2000). Here, 
for a generic class J- of measurable functions on some measurable space X , we 
say that T is pointwise measurable if there exists a countable class Q of measur- 
able functions on X such that for any / 6 J 7 , there exists a sequence {g m } C Q 
with g m {x) — > f(x) for all x £ X. See Chapter 2.3 of van der Vaart and Wellner 
(1996). 

Theorem 5 (Massart's form of Talagrand's inequality). Let £j, i = 1,2, ... , n 
be i.i.d. random variables taking values in some measurable space X . Let T be a 
pointwise measurable class of functions on X such that E[f(£i)] = for all f € T 
and supj 6 jr sup xg ^ 1/(^)1 < B for some constant B > 0. Let a 1 be any positive 
constant such that a 2 > supj 6 j F E[/ 2 (^ 1 )]. Let Z := sup^-g^ | Then, 
for all x > 0, we have 

P{Z > C{E[Z] + a^/n~x~ + Bx)} < e~ x , 

where C > is a universal constant. 
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